ml dataset
- Europe > Switzerland > Zürich > Zürich (0.14)
- South America > Paraguay > Asunción > Asunción (0.04)
- Europe > Austria (0.04)
- (13 more...)
- Research Report (0.67)
- Questionnaire & Opinion Survey (0.49)
- Overview (0.46)
- Europe > Switzerland > Zürich > Zürich (0.14)
- South America > Paraguay > Asunción > Asunción (0.04)
- Europe > Austria (0.04)
- (13 more...)
- Research Report (0.67)
- Questionnaire & Opinion Survey (0.49)
- Overview (0.46)
RecKG: Knowledge Graph for Recommender Systems
Kwon, Junhyuk, Ahn, Seokho, Seo, Young-Duk
Knowledge graphs have proven successful in integrating heterogeneous data across various domains. However, there remains a noticeable dearth of research on their seamless integration among heterogeneous recommender systems, despite knowledge graph-based recommender systems garnering extensive research attention. This study aims to fill this gap by proposing RecKG, a standardized knowledge graph for recommender systems. RecKG ensures the consistent representation of entities across different datasets, accommodating diverse attribute types for effective data integration. Through a meticulous examination of various recommender system datasets, we select attributes for RecKG, ensuring standardized formatting through consistent naming conventions. By these characteristics, RecKG can seamlessly integrate heterogeneous data sources, enabling the discovery of additional semantic information within the integrated knowledge graph. We apply RecKG to standardize real-world datasets, subsequently developing an application for RecKG using a graph database. Finally, we validate RecKG's achievement in interoperability through a qualitative evaluation between RecKG and other studies.
- Europe > Spain > Castile and León > Ávila Province > Ávila (0.05)
- Asia > South Korea > Incheon > Incheon (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- Asia > China > Jiangsu Province > Yancheng (0.04)
- Media > Film (1.00)
- Leisure & Entertainment (1.00)
- Media > Music (0.93)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Semantic Networks (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
Anticipating Technical Expertise and Capability Evolution in Research Communities using Dynamic Graph Transformers
Horawalavithana, Sameera, Ayton, Ellyn, Usenko, Anastasiya, Cosbey, Robin, Volkova, Svitlana
The ability to anticipate technical expertise and capability evolution trends globally is essential for national and global security, especially in safety-critical domains like nuclear nonproliferation (NN) and rapidly emerging fields like artificial intelligence (AI). In this work, we extend traditional statistical relational learning approaches (e.g., link prediction in collaboration networks) and formulate a problem of anticipating technical expertise and capability evolution using dynamic heterogeneous graph representations. We develop novel capabilities to forecast collaboration patterns, authorship behavior, and technical capability evolution at different granularities (e.g., scientist and institution levels) in two distinct research fields. We implement a dynamic graph transformer (DGT) neural architecture, which pushes the state-of-the-art graph neural network models by (a) forecasting heterogeneous (rather than homogeneous) nodes and edges, and (b) relying on both discrete -- and continuous -- time inputs. We demonstrate that our DGT models predict collaboration, partnership, and expertise patterns with 0.26, 0.73, and 0.53 mean reciprocal rank values for AI and 0.48, 0.93, and 0.22 for NN domains. DGT model performance exceeds the best-performing static graph baseline models by 30-80% across AI and NN domains. Our findings demonstrate that DGT models boost inductive task performance, when previously unseen nodes appear in the test data, for the domains with emerging collaboration patterns (e.g., AI). Specifically, models accurately predict which established scientists will collaborate with early career scientists and vice-versa in the AI domain.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
- Europe > Germany (0.05)
- Europe > Italy (0.05)
- (22 more...)
- Energy > Power Industry > Utilities > Nuclear (1.00)
- Government > Regional Government > North America Government > United States Government (0.68)
- Government > Military (0.68)
CML-TTS A Multilingual Dataset for Speech Synthesis in Low-Resource Languages
Oliveira, Frederico S., Casanova, Edresson, Júnior, Arnaldo Cândido, Soares, Anderson S., Filho, Arlindo R. Galvão
In this paper, we present CML-TTS, a recursive acronym for CML-Multi-Lingual-TTS, a new Text-to-Speech (TTS) dataset developed at the Center of Excellence in Artificial Intelligence (CEIA) of the Federal University of Goias (UFG). CML-TTS is based on Multilingual LibriSpeech (MLS) and adapted for training TTS models, consisting of audiobooks in seven languages: Dutch, French, German, Italian, Portuguese, Polish, and Spanish. Additionally, we provide the YourTTS model, a multi-lingual TTS model, trained using 3,176.13 hours from CML-TTS and also with 245.07 hours from LibriTTS, in English. Our purpose in creating this dataset is to open up new research possibilities in the TTS area for multi-lingual models. The dataset is publicly available under the CC-BY 4.0 license1.
- South America > Brazil (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
- (2 more...)
- Information Technology (0.47)
- Media (0.35)
Cleanlab: Correct your data labels automatically and quickly – Towards AI
Originally published on Towards AI. I used an open-sourced library, cleanlab, to remove low-quality labels on an image dataset. The model trained on the dataset without low-quality data gained 4 percentage points of accuracy compared to the baseline model (trained on all data). Improving data quality sounds easy enough. But the workload of manually checking data quality can quickly become insurmountable as the dataset scales.
Elements of effective machine learning datasets in astronomy
Boscoe, Bernie, Do, Tuan, Jones, Evan, Li, Yunqi, Alfaro, Kevin, Ma, Christy
In this work, we identify elements of effective machine learning datasets in astronomy and present suggestions for their design and creation. Machine learning has become an increasingly important tool for analyzing and understanding the large-scale flood of data in astronomy. To take advantage of these tools, datasets are required for training and testing. However, building machine learning datasets for astronomy can be challenging. Astronomical data is collected from instruments built to explore science questions in a traditional fashion rather than to conduct machine learning. Thus, it is often the case that raw data, or even downstream processed data is not in a form amenable to machine learning. We explore the construction of machine learning datasets and we ask: what elements define effective machine learning datasets? We define effective machine learning datasets in astronomy to be formed with well-defined data points, structure, and metadata. We discuss why these elements are important for astronomical applications and ways to put them in practice. We posit that these qualities not only make the data suitable for machine learning, they also help to foster usable, reusable, and replicable science practices.
- North America > United States > California > Los Angeles County > Los Angeles (0.17)
- Europe > Netherlands (0.04)
- Asia > Japan (0.04)
cleanlab 2.0: Automatically Find Errors in ML Datasets
Distributed ML is an active area of work, in both academia and industry, and it has been for some time now. Companies like Google were doing distributed machine learning decades ago. For some use cases, libraries like scikit-learn are totally adequate, while for other use cases, e.g. when using sophisticated models that require a lot of compute to train, training over large datasets that don't fit on a single node, distributed computing is essential. On the topic of data storage: in some cases, system builders do co-design the data storage and data processing, e.g. Such co-design can give performance gains.
- Information Technology > Artificial Intelligence > Machine Learning (0.65)
- Information Technology > Data Science > Data Mining > Big Data (0.45)
Play With Your ML Dataset -- Cheatsheet in R
Understanding data usually is half the battle won. For any machine learning project it helps immensely to analyze your data from different points of view. Summarising a dataset means understanding how your data looks when subjected to simple statistical anaylsis. To illustrate the various techniques let us consider the glass dataset from the r package mlbench. It has 214 observations containing examples of the chemical analysis of 7 different types of glass. For a quick look we try to display the first 10 rows of the data.